Scaling up top-K cosine similarity search

نویسندگان

  • Shiwei Zhu
  • Junjie Wu
  • Hui Xiong
  • Guoping Xia
چکیده

Article history: Received 21 September 2009 Received in revised form 23 August 2010 Accepted 23 August 2010 Available online 8 September 2010 Recent years have witnessed an increased interest in computing cosine similarity in many application domains. Most previous studies require the specification of a minimum similarity threshold to perform the cosine similarity computation. However, it is usually difficult for users to provide an appropriate threshold in practice. Instead, in this paper, we propose to search top-K strongly correlated pairs of objects as measured by the cosine similarity. Specifically, we first identify the monotone property of an upper bound of the cosine measure and exploit a diagonal traversal strategy for developing a TOP-DATA algorithm. In addition, we observe that a diagonal traversal strategy usually leads to more I/O costs. Therefore, we develop a max-first traversal strategy and propose a TOP-MATA algorithm. A theoretical analysis shows that TOPMATA has the advantages of saving the computations for false-positive item pairs and can significantly reduce I/O costs. Finally, experimental results demonstrate the computational efficiencies of both TOP-DATA and TOP-MATA algorithms. Also, we show that TOP-MATA is particularly scalable for large-scale data sets with a large number of items. © 2010 Elsevier B.V. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scaling up all pairs similarity search pdf

Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity.ABSTRACT. Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity. Scaling up all pairs similarity search, Published by ACM. The problem of finding a...

متن کامل

Cosine Similarity Search with Multi Index Hashing

Due to rapid development of the Internet, recent years have witnessed an explosion in the rate of data generation. Dealing with data at current scales brings up unprecedented challenges. From the algorithmic view point, executing existing linear algorithms in information retrieval and machine learning on such tremendous amounts of data incur intolerable computational and storage costs. To addre...

متن کامل

Dynamic Multi-keyword Top-k Ranked Search over Encrypted Cloud Data

Nowadays, more and more people are motivated to outsource their local data to public cloud servers for great convenience and reduced costs in data management. But in consideration of privacy issues, sensitive data should be encrypted before outsourcing, which obsoletes traditional data utilization like keyword-based document retrieval. In this paper, we present a secure and efficient multi-keyw...

متن کامل

Scaling up cosine interesting pattern discovery: A depth-first method

This paper presents an efficient algorithm called CosMinert for interesting pattern discovery. The widely used cosine similarity, found to possess the null-invariance property and the anti-cross-support-pattern property, is adopted as the interestingness measure in CosMinert . CosMinert is generally an FP-growth-like depth-first traversal algorithm that rests on an important property of the cos...

متن کامل

SimDex: Exploiting Model Similarity in Exact Matrix Factorization Recommendations

We present SIMDEX, a new technique for serving exact top-K recommendations on matrix factorization models that measures and optimizes for the similarity between users in the model. Previous serving techniques presume a high degree of similarity (e.g., L2 or cosine distance) among users and/or items in MF models; however, as we demonstrate, the most accurate models are not guaranteed to exhibit ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Data Knowl. Eng.

دوره 70  شماره 

صفحات  -

تاریخ انتشار 2011